Digital Transformation by Mark Baker

Digital Transformation by Mark Baker

Author:Mark Baker
Language: eng
Format: mobi, azw3, epub, pdf
Published: 2014-12-19T08:00:00+00:00


Figure 17 The popular view at present among data scientists is that algorithms are pivotal to success. Here we illustrate the principle that they are part of the chain.

The popular belief is that you can send an algorithm over raw data and have insights pop up. Typically problems arise as new data tables are added ad hoc or even at each agile sprint, and as new analyses are created without any backward compatibility - “we couldn’t have known we needed them ahead of time”. The excuse shows poor competence as nothing was preventing an expandable structure being designed at the first instance except the philosophy “we didn’t know that we would ever have to change”. The chance to analyse huge, valuable archives of historical data are dismissed with the mantra “it’s not that simple” when what was really meant was “I didn’t think ahead” or worse “I refused to think ahead”.

You can use software to clean the data, but the situation is analogous to signal to noise in an audio system and so, just as in hard science, the real solution is to design things so that you get clean data from the start. GIGO counts more in big data than ever before[41]. Having to use data cleaning of any sort means that you have put the wrong figures in. The only acceptable solution is to put the right ones in. Any technological fix is worse than a kludge[42], it is a bodge.

On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" . . . I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.

Charles Babbage, Passages from the Life of a Philosopher

Monica Rogati, vice president for data science at Jawbone, gives us an idea of how big and widespread the problem is when she says that “Data wrangling is a huge — and surprisingly so — part of the job,” going on to say “It’s something that is not appreciated by data civilians. At times, it feels like everything we do.”

The magnitude of the problem is re-iterated. “It’s an absolute myth that you can send an algorithm over raw data and have insights pop up,” said Jeffrey Heer, a professor of computer science at the University of Washington and a co-founder of Trifacta, a start-up based in San Francisco.

Big data typically has to operate over a parallel network to allow adequate storage and processing ability. The most common current method to implement this is a Cloud architecture. A Chief Digital Officer needs to have a good understanding of parallel data systems and of at least one scalable cloud platform. In contrast to more traditional, low-level High Performance models that interact with parallel and distributed hardware, such as OpenMP for shared memory and MPI for distributed memory systems, users typically interact with the cloud at a higher level using tools such as Hadoop or the implementation of server farms.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.